Bi-encoder/ Cross encoder

2025330

11:26

    Characteristics of Sentence Transformer (a.k.a bi-encoder) models:

    1. Calculates a fixed-size vector representation (embedding) given texts or images.
    2. Embedding calculation is often efficient, embedding similarity calculation is very fast.
    3. Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
    4. Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.

     

    from sentence_transformers import SentenceTransformer

     

    # 1. Load a pretrained Sentence Transformer model

    model = SentenceTransformer("all-MiniLM-L6-v2")

     

    # The sentences to encode

    sentences = [

        "The weather is lovely today.",

        "It's so sunny outside!",

        "He drove to the stadium.",

    ]

     

    # 2. Calculate embeddings by calling model.encode()

    embeddings = model.encode(sentences)

    print(embeddings.shape)

    # [3, 384]

     

    # 3. Calculate the embedding similarities

    similarities = model.similarity(embeddings, embeddings)

    print(similarities)

    # tensor([[1.0000, 0.6660, 0.1046],

    #         [0.6660, 1.0000, 0.1411],

    #         [0.1046, 0.1411, 1.0000]])

     

    Characteristics of Cross Encoder (a.k.a reranker) models:

    • Calculates a similarity score given pairs of texts.
    • Generally provides superior performance compared to a Sentence Transformer (a.k.a. bi-encoder) model.
    • Often slower than a Sentence Transformer model, as it requires computation for each pair rather than each text.
    • Due to the previous 2 characteristics, Cross Encoders are often used to re-rank the top-k results from a Sentence Transformer model.

     

    from sentence_transformers import CrossEncoder

     

    # 1. Load a pre-trained CrossEncoder model

    model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

     

    # 2. Predict scores for a pair of sentences

    scores = model.predict([

        ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),

        ("How many people live in Berlin?", "Berlin is well known for its museums."),

    ])

    # => array([ 8.607138 , -4.3200774], dtype=float32)

     

    # 3. Rank a list of passages for a query

    query = "How many people live in Berlin?"

    passages = [

        "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",

        "Berlin is well known for its museums.",

        "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",

        "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",

        "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",

        "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",

        "Berlin is subdivided into 12 boroughs or districts (Bezirke).",

        "In 2015, the total labour force in Berlin was 1.85 million.",

        "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",

        "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",

    ]

    ranks = model.rank(query, passages)

     

    # Print the scores

    print("Query:", query)

    for rank in ranks:

        print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")

    """

    Query: How many people live in Berlin?

    8.92    The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.

    8.61    Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.

    8.24    An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.

    7.60    In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.

    6.35    In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.

    5.42    Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.

    3.45    In 2015, the total labour force in Berlin was 1.85 million.

    0.33    Berlin is subdivided into 12 boroughs or districts (Bezirke).

    -4.24   The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019

    -4.32   Berlin is well known for its museums.

    """

     

     

    Bi-encoder:

    • Architecture: In a bi-encoder model, there are two separate encoders — one for encoding the input query and another for encoding the candidate documents. These encoders work independently, producing embeddings for the query and each document. 一般是一个encoder with shared weights.
    • Training: During training, the model is trained to maximize the similarity between the query and the relevant document, while minimizing the similarity between the query and irrelevant documents. Training is often done with a contrastive loss function.
    • Scoring: At inference time, the model calculates the similarity score between the query and each document independently. The document with the highest similarity score is considered the most relevant.
    • Use Cases: Bi-encoders are commonly used in tasks where document retrieval or ranking is the primary goal, such as search engines or recommendation systems.

     

     

    Cross-Encoder:

    • Architecture: In a cross-encoder model, the query and document are processed together in a single encoder. This means that the model takes both the query and the document as input and produces a joint representation.
    • Training: Similar to bi-encoders, cross-encoders are trained to maximize the similarity between relevant query-document pairs. However, since they process both the query and document together, they capture interactions between the two.
    • Scoring: Cross-encoders generate a single similarity score for each query-document pair, considering the interaction between the query and document embeddings. The document with the highest score is considered the most relevant.
    • Use Cases: Cross-encoders are useful when capturing the interaction between the query and document is crucial, such as in tasks where understanding the context or relationship between the query and document is important.

     

     

    When to Use Which One:

    • Bi-Encoder: Use bi-encoders when you have large-scale datasets and computational resources. They are often faster during inference since similarity scores can be computed independently. They are suitable for tasks where capturing complex interactions between the query and document is less critical.
    • Cross-Encoder: Choose cross-encoders when capturing interactions between the query and document is crucial for your task. They are more computationally intensive but can provide better performance in scenarios where understanding the context or relationship between the query and document is essential.

     

     

     

    总结:Bi encoder的输入是单个句子,输出是句子的embedding,通过比较两个句子embedding的距离来计算相似性。

    Cross encoder的输入是句子对,也就是两个句子,输出是两个句子的相似性得分,因此可以捕捉到两个句子之间的interaction.

 

已使用 OneNote 创建。